7 research outputs found

    Applying big data paradigms to a large scale scientific workflow: lessons learned and future directions

    Get PDF
    The increasing amounts of data related to the execution of scientific workflows has raised awareness of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining the traditional high-performance computing and grid-based approaches with Big Data analytics paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming model, and the underlying infrastructure that integrates storage and computation in each node. We experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKF-HGS, which we re-implemented using the Spark data analysis framework. We conducted experiments on a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2). The results we obtained were analysed to synthesize the lessons we learned from this experience, while discussing promising directions for further research.This work was supported by the Spanish Ministry of Economics and Competitiveness grant TIN-2013-41350-P, the IC1305 COST Action “Network for Sustainable Ultrascale Computing Platforms” (NESUS), and the FPU Training Program for Academic and Teaching Staff Grant FPU15/00422 by the Spanish Ministry of Education

    Efficient design assessment in the railway electric infrastructure domain using cloud computing

    Get PDF
    Nowadays, railway infrastructure designers rely heavily on computer simulators and expert systems to model, analyze and evaluate potential deployments prior to their installation. This paper presents the railway power consumption simulator model (RPCS), a cloud-based model for the design, simulation and evaluation of railway electric infrastructures. This model integrates the parameters of an infrastructure within a search engine that generates and evaluates a set of simulations to achieve optimal designs, according to a given set of objectives and restrictions. The knowledge of the domain is represented as an ontology that translates the elements in the infrastructure into an electric circuit, which is simulated to obtain a wide range of electric metrics. In order to support the execution of thousands of scenarios in a scalable, efficient and fault-tolerant manner, this paper introduces an architecture to deploy the model in a cloud environment, and a dimensioning model to find the types and number of instances that maximize performance while minimizing the externalization costs. The resulting model is applied to a particular case study, allowing the execution of over one thousand concurrent experiments in a virtual cluster on the Amazon Elastic Compute Cloud.This work has been partially funded under the grant TIN2013-41350-P of the Spanish Ministry of Economics and Competitiveness, and the COST Action IC1305 ”Network for Sustainable Ultrascale Computing Platforms” (NESUS)

    Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

    Get PDF
    Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.This work was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Grant TIN2016-79637-P(toward Unification of HPC and Big Data Paradigms), in part by the Spanish Ministry of Education under Grant FPU15/00422 TrainingProgram for Academic and Teaching Staff Grant, in part by the Advanced Scientific Computing Research, Office of Science, U.S.Department of Energy, under Contract DE-AC02-06CH11357, and in part by the DOE with under Agreement DE-DC000122495,Program Manager Laura Biven

    Spark-DIY: A framework for interoperable Spark Operations with high performance Block-Based Data Models

    Get PDF
    This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P ”Towards Unification of HPC and Big Data Paradigms”; the Spanish Ministry of Education under the FPU15/00422 Training Program for Academic and Teaching Staff Grant; the Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357; and by DOE with agreement No. DE-DC000122495, program manager Laura Biven

    A cloudification methodology for multidimensional analysis: Implementation and application to a railway power simulator

    Get PDF
    Many scientific areas make extensive use of computer simulations to study complex real-world processes. These computations are typically very resource-intensive and present scalability issues as experiments get larger even in dedicated clusters, since these are limited by their own hardware resources. Cloud computing raises as an option to move forward into the ideal unlimited scalability by providing virtually infinite resources, yet applications must be adapted to this new paradigm. This process of converting and/or migrating an application and its data in order to make use of cloud computing is sometimes known as cloudifying the application. We propose a generalist cloudification method based in the MapReduce paradigm to migrate scientific simulations into the cloud to provide greater scalability. We analysed its viability by applying it to a real-world railway power consumption simulatior and running the resulting implementation on Hadoop YARN over Amazon EC2. Our tests show that the cloudified application is highly scalable and there is still a large margin to improve the theoretical model and its implementations, and also to extend it to a wider range of simulations. We also propose and evaluate a multidimensional analysis tool based on the cloudified application. It generates, executes and evaluates several experiments in parallel, for the same simulation kernel. The results we obtained indicate that out methodology is suitable for resource intensive simulations and multidimensional analysis, as it improves infrastructure’s utilization, efficiency and scalability when running many complex experiments.This work has been partially funded under the grant TIN2013-41350-P of the Spanish Ministry of Economics and Competitiveness, and the COST Action IC1305 "Network for Sustainable Ultrascale Computing Platforms" (NESUS)

    On the convergence of big data analytics and high-performance computing: a novel approach for runtime interoperability

    Get PDF
    Mención Internacional en el título de doctorConvergence between high-performance computing (HPC) and Big Data analytics (BDA) is currently an established research area that spawned new opportunities for unifying the platformlayer and data abstractions in these ecosystems. This thesis builds on the hypothesis that HPC-BDA convergence at platform level can be attained by enabling runtime interoperability in a way that preserves BDA platform usability and productivity, exploits HPC scalability and performance, and expands both BDA and HPC capabilities to cope with prospect hybrid application models. The goal is to architect an abstract system that enables the interoperability of established BDA and HPC runtimes. In order to exploit the benefits of BDA data-centric paradigms, this thesis presents a data-centric transformation methodology to allow process-centric workloads the interaction with BDA platforms and storage infrastructures. Furthermore, an architecture to achieve full runtime interoperability is proposed. It reflects the key design features that interest both the HPC and BDA communities, and includes an abstract data collection and operational model that generates a unified interface for hybrid applications. It also incorporates a mechanism to transfer each stage of the application to the appropriate runtime. This architecture can be implemented in different ways depending on the process- and data-centric runtimes of choice, and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platformis introduced as a possible implementation. It preserves the interfaces and execution environment of the popular BDA platformApache Spark –thus making it compatible with any Spark-based application and tool– while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Finally, these solutions are analysed in terms of performance by applying them to a representative use case, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving towards hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment. Other auxiliary use cases –like an application from the railway domain and a BDA benchmark operator– are also introduced to support other specific contributions of this thesis.La convergencia entre la computación de altas prestaciones (HPC) y el análisis de macrodatos (BDA) es actualmente un área de investigación establecida que ha generado nuevas oportunidades para la unificación de la capa de plataforma y las abstracciones de datos en estos ecosistemas. Esta tesis desarrolla la hipótesis de que la convergencia HPC-BDA a nivel de plataforma puede ser obtenida con la habilitación de mecanismos de interoperabilidad entre entornos de ejecución, de modo que se preserve la usabilidad y productividad de las plataformas BDA, se explote la escalabilidad y rendimiento de HPC, y se expandan las capacidades de HPC y BDA para tratar futuros modelos híbridos de aplicación. El objetivo es desarrollar un sistema abstracto que permita la interoperabilidad de entornos de ejecución ya establecidos en los ecosistemas BDA y HPC. Con el fin de explotar los beneficios de los paradigmas orientados a datos en BDA, esta tesis presenta una metodología de transformación también orientada a datos que permite a las aplicaciones orientadas a proceso interactuar con plataformas BDA y sus correspondientes infraestructuras de almacenamiento. Además, se propone una arquitectura para obtener interoperabilidad total entre entornos de ejecución. Ésta refleja las características de diseño clave que interesan a las comunidades BDA y HPC, e incluye una abstracción de colección de datos y modelo operacional que genera una interfaz unificada para aplicaciones híbridas. Además, incorpora un mecanismo para transferir cada etapa de la aplicación al entorno de ejecución adecuado. Esta arquitectura puede ser implementada de distintas maneras dependiendo de los entornos de ejecución orientados a datos y proceso seleccionados, y las tcnicas utilizadas para cumplir de manera efectiva con los requisitos de la arquitectura. La plataforma Spark-DIY se introduce como posible implementación. Preserva las interfaces y entorno de ejecución de la popular plataforma BDA Apache Spark –haciéndola compatible con cualquier aplicación o herramienta basada en Spark–, mientras provee comunicación y ejecución eficiente de núcleos de simulación y análisis a través de DIY, una potente biblioteca de patrones de comunicación construida sobre MPI. Finalmente, estas soluciones son analizadas en términos de rendimiento al aplicarlas a un caso de uso representativo, EnKF-HGS. Esta aplicación es un ejemplo claro de cómo las simulaciones HPC están evolucionando hacia aplicaciones HPC-BDA híbridas, integrando simulaciones HPC dentro de un entorno BDA. Otros casos de uso auxiliares –como una aplicación del ámbito ferroviario y un operador referente de BDA– son introducidos para apoyar otras contribuciones específicas de esta tesis.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Laurent Lefevre.- Secretario: David Expósito Singh.- Vocal: Mª de los Santos Pérez Hernánde

    Novel Approaches Toward Scalable Composable Workflows in Hyper-Heterogeneous Computing Environments

    No full text
    International audienceThe annual Workshop on Workflows in Support of Large-Scale Science (WORKS) is a premier venue for the scientific workflow community to present the latest advances in research and development on the many facets of scientific workflows throughout their life-cycle. The Lightning Talks at WORKS focus on describing a novel tool, scientific workflow, or concept, which are work-in-progress and address emerging technologies and frameworks to foster discussion in the community. This paper summarizes the lightning talks at the 2023 edition of WORKS, covering five topics: leveraging large language models to build and execute workflows; developing a common workflow scheduler interface; scaling uncertainty workflow applications on exascale computing systems; evaluating a transcriptomics workflow for cloud vs. HPC systems; and best practices in migrating legacy workflows to workflow management systems
    corecore